Skip to content

[CB] [Major] Add tensor paralellism#45821

Merged
remi-or merged 30 commits into
mainfrom
cb-tp2
May 18, 2026
Merged

[CB] [Major] Add tensor paralellism#45821
remi-or merged 30 commits into
mainfrom
cb-tp2

Conversation

@remi-or

@remi-or remi-or commented May 7, 2026

Copy link
Copy Markdown
Collaborator

This PR adds support for TP in continuous batching. The major changes required to do this were:

  • add inter-process communication for the requests states
  • add per TP group seeding
  • add hints to prevent NCCL graph mixing
  • change hash function to avoid python hash which is salted depending on the process

It also adds a mechanism to the benchmark script to make sure the generation is coherent.

Performance

Benchmark TP1 main tok/s TP1 tok/s TP2 tok/s Speedup TP1 acc TP2 acc
gsm8k_default 2,454 2,454 3,657 1.49× 0.822 0.819
gsm8k_sampling 1,940 1,942 2,616 1.35× 0.792 0.775
gsm8k_compile 2,463 2,467 3,689 1.50× 0.822 0.821
gsm8k_no_fast_decode 2,367 2,370 3,522 1.49× 0.822 0.819
gsm8k_bare_bones 1,877 1,881 2,331 1.24× 0.821 0.821
ifeval_default 7,890 7,898 15,135 1.92× 0.442 0.455
rollouts_1024 3,200 3,199 4,281 1.34×
rollouts_2048 3,048 3,049 4,194 1.38×
rollouts_4096 2,719 2,719 3,887 1.43×
rollouts_8192 2,209 2,211 3,345 1.51×
rollouts_16384 1,465 1,465 2,589 1.77×
few_blocks 686 696 840 1.21×
multi_return_seq 1,544 1,553 1,926 1.24×

No perf regression, TP is faster.

Tests

Added tests for TP, all tests run.

@remi-or remi-or requested a review from ArthurZucker May 7, 2026 09:32
@remi-or remi-or self-assigned this May 7, 2026
@HuggingFaceDocBuilderDev

Copy link
Copy Markdown

The docs for this PR live here. All of your documentation changes will be reflected on that endpoint. The docs are available until 30 days after the last update.

@ArthurZucker ArthurZucker left a comment

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM be careful, EP only will fail if you devide by default with tp size the head dim

Comment on lines 166 to 168
# Account for TP: each KV head is dispatched to a different GPU, so the effective number of KV heads per GPU is
# simply divided by the TP size (number of GPUs)
if tp_size is not None and tp_size > 1:

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

only if the attention k and v are target of the tp plan tho

Copy link
Copy Markdown
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What could be the other targets? Not familiar enough with the TP plan tbh

Copy link
Copy Markdown
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ok, added a boolean kv_is_tp = "layers.*.self_attn.k_proj" in config.tp_plan and "layers.*.self_attn.v_proj" in config.tp_plan to condition this.

logit_processor: The [`ContinuousBatchingLogitsProcessorList`] object used to process the logits.
input_queue: Queue for incoming requests
input_queue: Queue for incoming requests. Is None if this process is not a TP driver.
cancel_queue: Queue for cancellation request_ids. Is None if this process is not a TP driver.

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Okay will read the rest to see if all are drivers or not

Copy link
Copy Markdown
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

?

Comment thread src/transformers/generation/continuous_batching/continuous_api.py Outdated
Comment thread src/transformers/generation/continuous_batching/utils.py Outdated
@remi-or remi-or added this pull request to the merge queue May 18, 2026
Merged via the queue into main with commit 1e0594f May 18, 2026
92 of 95 checks passed
@remi-or remi-or deleted the cb-tp2 branch May 18, 2026 01:09
jp1924 pushed a commit to jp1924/transformers that referenced this pull request May 18, 2026
* TP heads and DP / TP seeds

* Reproducible hash

* Add the notion of TP drivers

* Fix NCCL device

* Temporary fix for multiple streams

* Better handling of NCCL graph mixing

* Fix cfg

* nit

* Move the seed setting

* Reworked overall to have accuracy scoring

* Adding tests 1/n

* Added tests

* Style

* Fixes

* CC review

* Nits

* Renames

* Small fixes

* Move distributed stuff to a distributed file

* Docstring

* Final fixes

* Review compliance

* Review compliance 2

* Rebase fix

* Style

* Less redudant testing suite

* Fix TP plan

* Fix stopping condition

* Nits
yuchenxie4645 pushed a commit to yuchenxie4645/transformers that referenced this pull request May 28, 2026
* TP heads and DP / TP seeds

* Reproducible hash

* Add the notion of TP drivers

* Fix NCCL device

* Temporary fix for multiple streams

* Better handling of NCCL graph mixing

* Fix cfg

* nit

* Move the seed setting

* Reworked overall to have accuracy scoring

* Adding tests 1/n

* Added tests

* Style

* Fixes

* CC review

* Nits

* Renames

* Small fixes

* Move distributed stuff to a distributed file

* Docstring

* Final fixes

* Review compliance

* Review compliance 2

* Rebase fix

* Style

* Less redudant testing suite

* Fix TP plan

* Fix stopping condition

* Nits
kashif pushed a commit to kashif/transformers that referenced this pull request Jun 1, 2026
* TP heads and DP / TP seeds

* Reproducible hash

* Add the notion of TP drivers

* Fix NCCL device

* Temporary fix for multiple streams

* Better handling of NCCL graph mixing

* Fix cfg

* nit

* Move the seed setting

* Reworked overall to have accuracy scoring

* Adding tests 1/n

* Added tests

* Style

* Fixes

* CC review

* Nits

* Renames

* Small fixes

* Move distributed stuff to a distributed file

* Docstring

* Final fixes

* Review compliance

* Review compliance 2

* Rebase fix

* Style

* Less redudant testing suite

* Fix TP plan

* Fix stopping condition

* Nits
khushali9 pushed a commit to khushali9/transformers that referenced this pull request Jun 8, 2026
* TP heads and DP / TP seeds

* Reproducible hash

* Add the notion of TP drivers

* Fix NCCL device

* Temporary fix for multiple streams

* Better handling of NCCL graph mixing

* Fix cfg

* nit

* Move the seed setting

* Reworked overall to have accuracy scoring

* Adding tests 1/n

* Added tests

* Style

* Fixes

* CC review

* Nits

* Renames

* Small fixes

* Move distributed stuff to a distributed file

* Docstring

* Final fixes

* Review compliance

* Review compliance 2

* Rebase fix

* Style

* Less redudant testing suite

* Fix TP plan

* Fix stopping condition

* Nits
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants